[ARROW-7168] [Python] pa.array() doesn't respect specified dictionary type - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.15.1
Fix Version/s: 0.16.0
Component/s: C++, Python
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/23469

Description

This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs:

# This may look a little artificial but easily occurs when processing categorial data in batches and a particular batch containing only NaNs
ser = pd.Series([None, None]).astype('object').astype('category')
typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
pa.array(ser, type=typ).type

results in

>> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)

which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected).

I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case?

In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs?

Attachments

Issue Links

links to

GitHub Pull Request #5866

Activity

People

Assignee:: Joris Van den Bossche

Reporter:: Thomas Buhrmann

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 14/Nov/19 12:21

Updated:: 11/Jan/23 07:51

Resolved:: 21/Nov/19 10:18

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 20m