Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17788

RangePartitioner results in few very large tasks and many small to empty tasks

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.0.0
    • 2.3.0
    • Spark Core, SQL
    • None
    • Ubuntu 14.04 64bit
      Java 1.8.0_101

    Description

      Greetings everyone,

      I was trying to read a single field of a Hive table stored as Parquet in Spark (~140GB for the entire table, this single field is a Double, ~1.4B records) and look at the sorted output using the following:
      sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
      ​But this simple line of code gives:
      Caused by: java.lang.IllegalArgumentException: Cannot allocate a page with more than 17179869176 bytes

      Same error for:
      sql("SELECT " + field + " FROM MY_TABLE).sort(field)
      and:
      sql("SELECT " + field + " FROM MY_TABLE).orderBy(field)

      After doing some searching, the issue seems to lie in the RangePartitioner trying to create equal ranges. [1]

      [1] https://spark.apache.org/docs/2.0.0/api/java/org/apache/spark/RangePartitioner.html

      The Double values I'm trying to sort are mostly in the range [0,1] (~70% of the data which roughly equates 1 billion records), other numbers in the dataset are as high as 2000. With the RangePartitioner trying to create equal ranges, some tasks are becoming almost empty while others are extremely large, due to the heavily skewed distribution.

      This is either a bug in Apache Spark or a major limitation of the framework. I hope one of the devs can help solve this issue.

      P.S. Email thread on Spark user mailing list:
      http://mail-archives.apache.org/mod_mbox/spark-user/201610.mbox/%3CCA%2B_of14hTVYTUHXC%3DmS9Kqd6qegVvkoF-ry3Yj2%2BRT%2BWSBNzhg%40mail.gmail.com%3E

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            cloud_fan Wenchen Fan
            babak.alipour@gmail.com Babak Alipour
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment