Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-1658

[Python] Out of bounds dictionary indices causes segfault after converting to pandas

    XMLWordPrintableJSON

Details

    Description

      Minimal reproduction:

      import numpy as np
      import pandas as pd
      import pyarrow as pa
       
      num = 100
      arr = pa.DictionaryArray.from_arrays(
          np.arange(0, num),
          np.array(['a'], np.object),
          np.zeros(num, np.bool),
          True)
      
      print(arr.to_pandas())
      

      At no time in the Arrow codebase do we validate that the dictionary indices are in bounds. It seems that pandas is overly trusting of the validity of the indices. So we should add a method someplace to validate that the dictionary non-null indices are not out of bounds (perhaps in CategoricalBlock::WriteIndices).

      As an aside: there may be other times when doing analytics on categorical data that external data will have out of bounds index values. We should plan for these and decide whether to raise an exception or treat them as null

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              wesm Wes McKinney
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: