Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21330

Bad partitioning does not allow to read a JDBC table with extreme values on the partition column

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.1
    • 2.1.2, 2.2.1, 2.3.0
    • SQL
    • None

    Description

      When using "extreme" values in the partition column (like having a randomly generated long number) overflow might happen, leading to the following warning message:

      WARN JDBCRelation: The number of partitions is reduced because the specified number of partitions is less than the difference between upper bound and lower bound. Updated number of partitions: -1559072469251914524; Input number of partitions: 20; Lower bound: -7701345953623242445; Upper bound: 9186325650834394647.

      When this happens, no data is read from the table.

      This happens because of the following check in org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala:

      if ((upperBound - lowerBound) >= partitioning.numPartitions)

      Funny thing is that we worry about overflows a few lines later:

          // Overflow and silliness can happen if you subtract then divide.
          // Here we get a little roundoff, but that's (hopefully) OK.

      A better check would be:

      if ((upperBound - partitioning.numPartitions) >= lowerBound)

      Attachments

        Activity

          People

            a1ray Andrew Ray
            parmesan Stefano Parmesan
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: