Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-35683

Fix Index.difference to avoid collect 'other' to driver side

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • PySpark
    • None

    Description

      See:
      https://github.com/databricks/koalas/pull/1325#discussion_r647889901
      https://github.com/databricks/koalas/pull/1325#discussion_r647890007

      midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
      midx1.difference(idx1)
      
      pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
      

      In addition, calling MultiIndex.from_tuples will result in collecting all into driver side.

      Attachments

        Issue Links

          Activity

            People

              itholic Haejoon Lee
              gurwls223 Hyukjin Kwon
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: