Details
-
Improvement
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
0.11.1
-
None
-
Fedora 29, pyarrow installed with conda
Description
When converting to pandas, convert nested types (e.g. list) to tuples. Columns with lists are difficult to query. Here are a few unsuccessful attempts:
>>> mini CHROM POS ID REF ALTS QUAL 80 20 63521 rs191905748 G [A] 100 81 20 63541 rs117322527 C [A] 100 82 20 63548 rs541129280 G [GT] 100 83 20 63553 rs536661806 T [C] 100 84 20 63555 rs553463231 T [C] 100 85 20 63559 rs138359120 C [A] 100 86 20 63586 rs545178789 T [G] 100 87 20 63636 rs374311122 G [A] 100 88 20 63696 rs149160003 A [G] 100 89 20 63698 rs544072005 A [C] 100 90 20 63729 rs181483669 G [A] 100 91 20 63733 rs75670495 C [T] 100 92 20 63799 rs1418258 C [T] 100 93 20 63808 rs76004960 G [C] 100 94 20 63813 rs532151719 G [A] 100 95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100 96 20 63865 rs551938596 G [A] 100 97 20 63902 rs571779099 A [T] 100 98 20 63963 rs531152674 G [A] 100 99 20 63967 rs116770801 A [G] 100 100 20 63977 rs199703510 C [G] 100 101 20 64016 rs143263863 G [A] 100 102 20 64062 rs148297240 G [A] 100 103 20 64139 rs186497980 G [A, T] 100 104 20 64150 rs7274499 C [A] 100 105 20 64151 rs190945171 C [T] 100 106 20 64154 rs537656456 T [G] 100 107 20 64175 rs116531220 A [G] 100 108 20 64186 rs141793347 C [G] 100 109 20 64210 rs182418654 G [C] 100 110 20 64303 rs559929739 C [A] 100
- I think this one fails because it tries to broadcast the comparison.
>>> mini[mini.ALTS == ["A", "T"]] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1283, in wrapper res = na_op(values, other) File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1143, in na_op result = _comp_method_OBJECT_ARRAY(op, x, y) File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/ops.py", line 1120, in _comp_method_OBJECT_ARRAY result = libops.vec_compare(x, y, op) File "pandas/_libs/ops.pyx", line 128, in pandas._libs.ops.vec_compare ValueError: Arrays were different lengths: 31 vs 2
- I think this fails due to a similar reason, but the broadcasting is happening at a different place.
>>> mini[mini.ALTS.apply(lambda x: x == ["A", "T"])] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2682, in __getitem__ return self._getitem_array(key) File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2726, in _getitem_array indexer = self.loc._convert_to_indexer(key, axis=1) File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexing.py", line 1314, in _convert_to_indexer indexer = check = labels.get_indexer(objarr) File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3259, in get_indexer indexer = self._engine.get_indexer(target._ndarray_values) File "pandas/_libs/index.pyx", line 301, in pandas._libs.index.IndexEngine.get_indexer File "pandas/_libs/hashtable_class_helper.pxi", line 1544, in pandas._libs.hashtable.PyObjectHashTable.lookup TypeError: unhashable type: 'numpy.ndarray' >>> mini.ALTS.apply(lambda x: x == ["A", "T"]).head() 80 [True, False] 81 [True, False] 82 [False, False] 83 [False, False] 84 [False, False]
- Unfortunately this clever hack fails as well!
>>> c = np.empty(1, object) >>> c[0] = ["A", "T"] >>> mini[mini.ALTS.values == c] Traceback (most recent call last): File "/home/user/miniconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc return self._engine.get_loc(key) File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item KeyError: False >>> mini.ALTS.values == c False
Finally, what succeeds is the following (probably because of the immutability of tuple):
>>> mini["ALTS2"] = mini.ALTS.apply(tuple) >>> mini.head() CHROM POS ID REF ALTS QUAL ALTS2 80 20 63521 rs191905748 G [A] 100 (A,) 81 20 63541 rs117322527 C [A] 100 (A,) 82 20 63548 rs541129280 G [GT] 100 (GT,) 83 20 63553 rs536661806 T [C] 100 (C,) 84 20 63555 rs553463231 T [C] 100 (C,) >>> mini[mini["ALTS2"] == ("A", "T")] CHROM POS ID REF ALTS QUAL ALTS2 103 20 64139 rs186497980 G [A, T] 100 (A, T) >>> mini[mini["ALTS2"] == ("GT",)] CHROM POS ID REF ALTS QUAL ALTS2 82 20 63548 rs541129280 G [GT] 100 (GT,) >>> mini[mini["ALTS2"] == tuple("C")] CHROM POS ID REF ALTS QUAL ALTS2 83 20 63553 rs536661806 T [C] 100 (C,) 84 20 63555 rs553463231 T [C] 100 (C,) 89 20 63698 rs544072005 A [C] 100 (C,) 93 20 63808 rs76004960 G [C] 100 (C,) 95 20 63857 rs543686274 CCTGGAAAGGATT [C] 100 (C,) 109 20 64210 rs182418654 G [C] 100 (C,)
Attachments
Issue Links
- relates to
-
ARROW-5287 [Python] automatic type inference for arrays of tuples
- Open