Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-17195

Dealing with JDBC column nullability when it is not reliable

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 2.0.0
    • None
    • SQL
    • None

    Description

      Starting with Spark 2.0.0, the column "nullable" property is important to have correct for the code generation to work properly. Marking the column as nullable = false used to (<2.0.0) allow null values to be operated on, but now this will result in:

      Caused by: java.lang.NullPointerException
              at org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter.write(UnsafeRowWriter.java:210)
              at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
              at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
              at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
              at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
      

      I'm all for the change towards a more ridged behavior (enforcing correct input). But the problem I'm facing now is that when I used JDBC to read from a Teradata server, the column nullability is often not correct (particularly when sub-queries are involved).

      This is the line in question:
      https://github.com/apache/spark/blob/v2.0.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L140

      I'm trying to work out what would be the way forward for me on this. I know that it's really the fault of the Teradata database server not returning the correct schema, but I'll need to make Spark itself or my application resilient to this behavior.

      One of the Teradata JDBC Driver tech leads has told me that "when the rsmd.getSchemaName and rsmd.getTableName methods return an empty zero-length string, then the other metadata values may not be completely accurate" - so one option could be to treat the nullability (at least) the same way as the "unknown" case (as nullable = true). For reference, see the rest of our discussion here: http://forums.teradata.com/forum/connectivity/teradata-jdbc-driver-returns-the-wrong-schema-column-nullability

      Any other thoughts?

      Attachments

        Activity

          People

            Unassigned Unassigned
            jasonmoore2k Jason Moore
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: