Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5713

[Python] fancy indexing on pa.array

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: C++, Python
    • Labels:
      None

      Description

      In numpy one can do :

      In [2]: import numpy as np                                                                                                                                      
      In [3]: a = np.array(['a', 'bb', 'ccc', 'dddd'], dtype="O")                                                                                                     
      In [4]: indices = np.array([0, -1, 2, 2, 0, 3])                                                                                                                 
      In [5]: a[indices]                                                                                                                                              
      Out[5]: array(['a', 'dddd', 'ccc', 'ccc', 'a', 'dddd'], dtype=object)
      

      It would be nice to have a similar feature in pyarrow.

      Currently, pa.arrow _getitem_ supports only a slice or a single element as an argument.

      Of course, using that we've some workarounds, like below

      In [6]: import pyarrow as pa                                                                                                                                    
      In [7]: a = pa.array(['a', 'bb', 'ccc', 'dddd'])                                                                                                                
      In [8]: pa.array(a.to_pandas()[indices])  # if len(indices) is high                                                                                                                       
      Out[8]:
      
      <pyarrow.lib.StringArray object at 0x91bd845e8>
      
      [
      
        "a",
      
        "dddd",
      
        "ccc",
      
        "ccc",
      
        "a",
      
        "dddd"
      
      ]
      
      In [9]: pa.array([a[i].as_py() for i in indices])  # if len(indices) is low                                                                                
      Out[9]:
      
      <pyarrow.lib.StringArray object at 0x91bc14868>
      
      [
      
        "a",
      
        "dddd",
      
        "ccc",
      
        "ccc",
      
        "a",
      
        "dddd"
      
      ]
      

      both are not memory&cpu efficient.

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ArtemK Artem KOZHEVNIKOV
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: