Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20451

Filter out nested mapType datatypes from sort order in randomSplit

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 2.1.0, 2.2.0
    • Fix Version/s: 2.0.3, 2.1.1, 2.2.0, 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      In randomSplit, It is possible that the underlying dataframe doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits.

      To prevent this, we explicitly sort each input partition to make the ordering deterministic. Given that MapTypes cannot be sorted they should be explicitly pruned out from the sort order. Additionally, if the resulting sort order is empty, we then materialize the dataset to guarantee determinism.

        Attachments

          Activity

            People

            • Assignee:
              sameerag Sameer Agarwal
              Reporter:
              sameerag Sameer Agarwal
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: