[SPARK-35683] Fix Index.difference to avoid collect 'other' to driver side - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.2.0
Component/s: PySpark
Labels:
None

Description

See:
https://github.com/databricks/koalas/pull/1325#discussion_r647889901
https://github.com/databricks/koalas/pull/1325#discussion_r647890007

midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
midx1.difference(idx1)

pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.

In addition, calling MultiIndex.from_tuples will result in collecting all into driver side.

Attachments

Issue Links

relates to

SPARK-35682 Pin mypy version in GitHub Actions CI

Resolved

SPARK-35684 Bump up mypy version in GitHub Actions

Resolved

links to

[Github] Pull Request #32853 (itholic)

Activity

People

Assignee:: Haejoon Lee

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jun/21 01:06

Updated:: 12/Dec/22 18:11

Resolved:: 15/Jun/21 05:19