Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3806

[Python] When converting nested types to pandas, use tuples

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 0.11.1
    • None
    • Python
    • Fedora 29, pyarrow installed with conda

    Description

      When converting to pandas, convert nested types (e.g. list) to tuples. Columns with lists are difficult to query. Here are a few unsuccessful attempts:

      >>> mini
          CHROM    POS           ID            REF    ALTS  QUAL
      80     20  63521  rs191905748              G     [A]   100
      81     20  63541  rs117322527              C     [A]   100
      82     20  63548  rs541129280              G    [GT]   100
      83     20  63553  rs536661806              T     [C]   100
      84     20  63555  rs553463231              T     [C]   100
      85     20  63559  rs138359120              C     [A]   100
      86     20  63586  rs545178789              T     [G]   100
      87     20  63636  rs374311122              G     [A]   100
      88     20  63696  rs149160003              A     [G]   100
      89     20  63698  rs544072005              A     [C]   100
      90     20  63729  rs181483669              G     [A]   100
      91     20  63733   rs75670495              C     [T]   100
      92     20  63799    rs1418258              C     [T]   100
      93     20  63808   rs76004960              G     [C]   100
      94     20  63813  rs532151719              G     [A]   100
      95     20  63857  rs543686274  CCTGGAAAGGATT     [C]   100
      96     20  63865  rs551938596              G     [A]   100
      97     20  63902  rs571779099              A     [T]   100
      98     20  63963  rs531152674              G     [A]   100
      99     20  63967  rs116770801              A     [G]   100
      100    20  63977  rs199703510              C     [G]   100
      101    20  64016  rs143263863              G     [A]   100
      102    20  64062  rs148297240              G     [A]   100
      103    20  64139  rs186497980              G  [A, T]   100
      104    20  64150    rs7274499              C     [A]   100
      105    20  64151  rs190945171              C     [T]   100
      106    20  64154  rs537656456              T     [G]   100
      107    20  64175  rs116531220              A     [G]   100
      108    20  64186  rs141793347              C     [G]   100
      109    20  64210  rs182418654              G     [C]   100
      110    20  64303  rs559929739              C     [A]   100
      
      1. I think this one fails because it tries to broadcast the comparison.
        >>> mini[mini.ALTS == ["A", "T"]]
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1283, in wrapper
            res = na_op(values, other)
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1143, in na_op
            result = _comp_method_OBJECT_ARRAY(op, x, y)
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1120, in _comp_method_OBJECT_ARRAY
            result = libops.vec_compare(x, y, op)
          File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare
        ValueError: Arrays were different lengths: 31 vs 2
        
      2. I think this fails due to a similar reason, but the broadcasting is happening at a different place.
        >>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])]
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__
            return self._getitem_array(key)
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
            indexer = self.loc._convert_to_indexer(key, axis=1)
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer
            indexer = check = labels.get_indexer(objarr)
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer
            indexer = self._engine.get_indexer(target._ndarray_values)
          File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer
          File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup
        TypeError: unhashable type: 'numpy.ndarray'
        >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head()
        80     [True, False]
        81     [True, False]
        82    [False, False]
        83    [False, False]
        84    [False, False]
        
      3. Unfortunately this clever hack fails as well!
        >>> c = np.empty(1, object)
        >>> c[0] = ["A", "T"]
        >>> mini[mini.ALTS.values == c]
        Traceback (most recent call last):
          File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
            return self._engine.get_loc(key)
          File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
          File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
          File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
          File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
        KeyError: False
        >>> mini.ALTS.values == c
        False
        

      Finally, what succeeds is the following (probably because of the immutability of tuple):

      >>> mini["ALTS2"] = mini.ALTS.apply(tuple)
      >>> mini.head()
         CHROM    POS           ID REF  ALTS  QUAL  ALTS2
      80    20  63521  rs191905748   G   [A]   100   (A,)
      81    20  63541  rs117322527   C   [A]   100   (A,)
      82    20  63548  rs541129280   G  [GT]   100  (GT,)
      83    20  63553  rs536661806   T   [C]   100   (C,)
      84    20  63555  rs553463231   T   [C]   100   (C,)
      >>> mini[mini["ALTS2"] == ("A", "T")]
          CHROM    POS           ID REF    ALTS  QUAL   ALTS2
      103    20  64139  rs186497980   G  [A, T]   100  (A, T)
      >>> mini[mini["ALTS2"] == ("GT",)]
         CHROM    POS           ID REF  ALTS  QUAL  ALTS2
      82    20  63548  rs541129280   G  [GT]   100  (GT,)
      >>> mini[mini["ALTS2"] == tuple("C")]
          CHROM    POS           ID            REF ALTS  QUAL ALTS2
      83     20  63553  rs536661806              T  [C]   100  (C,)
      84     20  63555  rs553463231              T  [C]   100  (C,)
      89     20  63698  rs544072005              A  [C]   100  (C,)
      93     20  63808   rs76004960              G  [C]   100  (C,)
      95     20  63857  rs543686274  CCTGGAAAGGATT  [C]   100  (C,)
      109    20  64210  rs182418654              G  [C]   100  (C,)
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              suvayu Suvayu Ali
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: