Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7168

[Python] pa.array() doesn't respect specified dictionary type

    XMLWordPrintableJSON

Details

    Description

      This might be related to ARROW-6548 and others dealing with all NaN columns. When creating a dictionary array, even when fully specifying the desired type, this type is not respected when the data contains only NaNs:

      # This may look a little artificial but easily occurs when processing categorial data in batches and a particular batch containing only NaNs
      ser = pd.Series([None, None]).astype('object').astype('category')
      typ = pa.dictionary(index_type=pa.int8(), value_type=pa.string(), ordered=False)
      pa.array(ser, type=typ).type
      

      results in

      >> DictionaryType(dictionary<values=null, indices=int8, ordered=0>)
      

      which means that one cannot e.g. serialize batches of categoricals if the possibility of all-NaN batches exists, even when trying to enforce that each batch has the same schema (because the schema is not respected).

      I understand that inferring the type in this case would be difficult, but I'd imagine that a fully specified type should be respected in this case?

      In the meantime, is there a workaround to manually create a dictionary array of the desired type containing only NaNs?

      Attachments

        Issue Links

          Activity

            People

              jorisvandenbossche Joris Van den Bossche
              buhrmann Thomas Buhrmann
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 20m
                  1h 20m