Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21330

Bad partitioning does not allow to read a JDBC table with extreme values on the partition column

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.1.1
    • Fix Version/s: 2.1.2, 2.2.1, 2.3.0
    • Component/s: SQL
    • Labels:
      None

      Description

      When using "extreme" values in the partition column (like having a randomly generated long number) overflow might happen, leading to the following warning message:

      WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: -1559072469251914524; Input number of partitions: 20; Lower bound: -7701345953623242445; Upper bound: 9186325650834394647.

      When this happens, no data is read from the table.

      This happens because of the following check in org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala:

      if ((upperBound - lowerBound) >= partitioning.numPartitions)

      Funny thing is that we worry about overflows a few lines later:

          // Overflow and silliness can happen if you subtract then divide.
          // Here we get a little roundoff, but that's (hopefully) OK.

      A better check would be:

      if ((upperBound - partitioning.numPartitions) >= lowerBound)

        Attachments

          Activity

            People

            • Assignee:
              a1ray Andrew Ray
              Reporter:
              parmesan Stefano Parmesan
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: