Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7591

[Python] DictionaryArray.to_numpy returns dict of parts instead of numpy array

    XMLWordPrintableJSON

    Details

      Description

      Currently, the to_numpy method doesn't return an ndarray incase of dictionaryd type data:

      In [54]: a = pa.array(pd.Categorical(["a", "b", "a"]))                                                                                                                                                             
      
      In [55]: a                                                                                                                                                                                                         
      Out[55]: 
      <pyarrow.lib.DictionaryArray object at 0x7f5c63d98f28>
      
      -- dictionary:
        [
          "a",
          "b"
        ]
      -- indices:
        [
          0,
          1,
          0
        ]
      
      In [57]: a.to_numpy(zero_copy_only=False)                                                                                                                                                                          
      Out[57]: 
      {'indices': array([0, 1, 0], dtype=int8),
       'dictionary': array(['a', 'b'], dtype=object),
       'ordered': False}
      

      This is actually just an internal representation that is passed from C++ to Python so on the Python side a pd.Categorical / CategoricalBlock can be constructed, but it's not something we should return as such to the user. Rather, I think we should return a decoded / dense numpy array (or at least error instead of returning this dict)

      (also, if the user wants those parts, they are already available from the dictionary array as a.indices, a.dictionary and a.type.ordered)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jorisvandenbossche Joris Van den Bossche
                Reporter:
                jorisvandenbossche Joris Van den Bossche
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h