Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1658

[Python] Out of bounds dictionary indices causes segfault after converting to pandas

    Details

      Description

      Minimal reproduction:

      import numpy as np
      import pandas as pd
      import pyarrow as pa
       
      num = 100
      arr = pa.DictionaryArray.from_arrays(
          np.arange(0, num),
          np.array(['a'], np.object),
          np.zeros(num, np.bool),
          True)
      
      print(arr.to_pandas())
      

      At no time in the Arrow codebase do we validate that the dictionary indices are in bounds. It seems that pandas is overly trusting of the validity of the indices. So we should add a method someplace to validate that the dictionary non-null indices are not out of bounds (perhaps in CategoricalBlock::WriteIndices).

      As an aside: there may be other times when doing analytics on categorical data that external data will have out of bounds index values. We should plan for these and decide whether to raise an exception or treat them as null

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesmckinn Wes McKinney
                Reporter:
                wesmckinn Wes McKinney
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: