Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-2515

Errors with DictionaryArray inside of ListArray or other DictionaryArray

    XMLWordPrintableJSON

    Details

      Description

      An exception ("KeyError: 26") is raised when .as_py() is called on elements of a ListArray over a DictionaryArray, or of a DictionaryArray with values in a DictionaryArray. Here are a couple tests that currently fail:

       

      import pyarrow as pa
      
      def test_dictionary_array_1():
          dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
          list_arr = pa.ListArray.from_arrays([0, 2, 3], dict_arr)
          assert list_arr.to_pylist() == [['a', 'b'], ['a']]
      
      def test_dictionary_array_2():
          dict_arr = pa.DictionaryArray.from_arrays([0, 1, 0], ['a', 'b'])
          dict_arr2 = pa.DictionaryArray.from_arrays([0, 1, 2, 1, 0], dict_arr)
          assert dict_arr2.to_pylist() == ['a', 'b', 'a', 'b', 'a']
      

      It appears that the problem is caused by the fact that the function box_scalar in scalar.pxi does not handle the case of dictionary array, as we currently have no DictionaryValue type. 

      DictionaryArray._getitem_ currently works around the lack of DictionaryValue type by dereferencing the index and constructing a scalar based on the value in the underlying dictionary. In other words, if we have a dictionary with int8 indices and string values, then the result of _getitem_ will be a StringValue (rather than a DictionaryValue). This works in simple cases but not in the more complex scenarios illustrated above.

      I have a patch ready, which would add a DictionaryValue type similar to other scalar types, resolving these bugs and removing the need for a special-cased implementation of DictionaryArray._getitem_. This DictionaryValue would contain a couple accessor properties, "indices_value" and "dictionary_value" to allow access to both the index in the dictionary as well as the looked-up value. Then DictionaryValue.as_py() would simply call .as_py() on the underlying dictionary_value. 

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bkerby Brent Kerby
                Reporter:
                bkerby Brent Kerby
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 1h Original Estimate - 1h
                  1h
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h
                  2h