Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Currently, the to_numpy method doesn't return an ndarray incase of dictionaryd type data:
In [54]: a = pa.array(pd.Categorical(["a", "b", "a"])) In [55]: a Out[55]: <pyarrow.lib.DictionaryArray object at 0x7f5c63d98f28> -- dictionary: [ "a", "b" ] -- indices: [ 0, 1, 0 ] In [57]: a.to_numpy(zero_copy_only=False) Out[57]: {'indices': array([0, 1, 0], dtype=int8), 'dictionary': array(['a', 'b'], dtype=object), 'ordered': False}
This is actually just an internal representation that is passed from C++ to Python so on the Python side a pd.Categorical / CategoricalBlock can be constructed, but it's not something we should return as such to the user. Rather, I think we should return a decoded / dense numpy array (or at least error instead of returning this dict)
(also, if the user wants those parts, they are already available from the dictionary array as a.indices, a.dictionary and a.type.ordered)
Attachments
Issue Links
- links to