Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7591

[Python] DictionaryArray.to_numpy returns dict of parts instead of numpy array

    XMLWordPrintableJSON

Details

    Description

      Currently, the to_numpy method doesn't return an ndarray incase of dictionaryd type data:

      In [54]: a = pa.array(pd.Categorical(["a", "b", "a"]))                                                                                                                                                             
      
      In [55]: a                                                                                                                                                                                                         
      Out[55]: 
      <pyarrow.lib.DictionaryArray object at 0x7f5c63d98f28>
      
      -- dictionary:
        [
          "a",
          "b"
        ]
      -- indices:
        [
          0,
          1,
          0
        ]
      
      In [57]: a.to_numpy(zero_copy_only=False)                                                                                                                                                                          
      Out[57]: 
      {'indices': array([0, 1, 0], dtype=int8),
       'dictionary': array(['a', 'b'], dtype=object),
       'ordered': False}
      

      This is actually just an internal representation that is passed from C++ to Python so on the Python side a pd.Categorical / CategoricalBlock can be constructed, but it's not something we should return as such to the user. Rather, I think we should return a decoded / dense numpy array (or at least error instead of returning this dict)

      (also, if the user wants those parts, they are already available from the dictionary array as a.indices, a.dictionary and a.type.ordered)

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              jorisvandenbossche Joris Van den Bossche
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 0.5h
                  0.5h