Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21657

Spark has exponential time complexity to explode(array of structs)

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      It can take up to half a day to explode a modest-sized nested collection (0.5m).
      On a recent Xeon processors.

      See attached pyspark script that reproduces this problem.

      cached_df = sqlc.sql('select individ, hholdid, explode(amft) from ' + table_name).cache()
      print sqlc.count()
      

      This script generate a number of tables, with the same total number of records across all nested collection (see `scaling` variable in loops). `scaling` variable scales up how many nested elements in each record, but by the same factor scales down number of records in the table. So total number of records stays the same.

      Time grows exponentially (notice log-10 vertical axis scale):

      At scaling of 50,000 (see attached pyspark script), it took 7 hours to explode the nested collections (!) of 8k records.

      After 1000 elements in nested collection, time grows exponentially.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            uzadude Ohad Raviv
            Tagar Ruslan Dautkhanov
            Votes:
            4 Vote for this issue
            Watchers:
            18 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment