Sqoop
  1. Sqoop
  2. SQOOP-308

Generated Avro Schema cannot handle nullable fields

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.0
    • Fix Version/s: 1.4.0-incubating
    • Component/s: codegen
    • Labels:
      None

      Description

      The generated avro schema for each column is just a primitive type. If a database contains a null value, this will fail.

      1. SQOOP-308.patch
        5 kB
        Aaron Kimball

        Activity

        Hide
        Aaron Kimball added a comment -

        Instead we should generate a union schema of [T, null], as is done in this patch.

        Updates existing schema-check testcase and adds a new testcase for null-containing records.

        Show
        Aaron Kimball added a comment - Instead we should generate a union schema of [T, null] , as is done in this patch. Updates existing schema-check testcase and adds a new testcase for null-containing records.
        Hide
        Tom White added a comment -

        The change looks good. How about using ResultSetMetaData#isNullable() to test which columns are nullable, and only creating a union schema for those?

        Show
        Tom White added a comment - The change looks good. How about using ResultSetMetaData#isNullable() to test which columns are nullable, and only creating a union schema for those?
        Hide
        Aaron Kimball added a comment -

        That's a far more invasive change, I think. Everywhere else in Sqoop (read: the existing ClassWriter) we assume all types are nullable. This would require modifying the ConnManager API to return nullability for all columns via a new method named something like "getNullabilityForColumns()" or else incompatibly changing getColumnTypesForQuery() to return a complex datatype that includes nullability and "normal" typing information. (And thus putting a lot of changes in orm.ClassWriter too.)

        Show
        Aaron Kimball added a comment - That's a far more invasive change, I think. Everywhere else in Sqoop (read: the existing ClassWriter) we assume all types are nullable. This would require modifying the ConnManager API to return nullability for all columns via a new method named something like "getNullabilityForColumns()" or else incompatibly changing getColumnTypesForQuery() to return a complex datatype that includes nullability and "normal" typing information. (And thus putting a lot of changes in orm.ClassWriter too.)
        Hide
        Aaron Kimball added a comment -

        PS - I think it's perfectly reasonable to do something to improve Sqoop's precision w.r.t. nullability and typing. But I don't think it's part of this issue. I think that this issue ("the Avro output format is currently broken") is separate from "The Avro output format could be more precise."

        It's also worth considering that since clients of these data files will use the Avro generic API (as Sqoop doesn't generate a schema file you can turn into a SpecificRecord), end-users will be dealing with object references returned by GenericRecord.get() anyway (as opposed to primitive types), so providing a more precise schema on the nullability doesn't buy you too much.

        Show
        Aaron Kimball added a comment - PS - I think it's perfectly reasonable to do something to improve Sqoop's precision w.r.t. nullability and typing. But I don't think it's part of this issue. I think that this issue ("the Avro output format is currently broken") is separate from "The Avro output format could be more precise." It's also worth considering that since clients of these data files will use the Avro generic API (as Sqoop doesn't generate a schema file you can turn into a SpecificRecord), end-users will be dealing with object references returned by GenericRecord.get() anyway (as opposed to primitive types), so providing a more precise schema on the nullability doesn't buy you too much.
        Hide
        Tom White added a comment -

        Good point. I agree that checking for nullability doesn't need doing in this JIRA.

        +1 on the patch.

        Show
        Tom White added a comment - Good point. I agree that checking for nullability doesn't need doing in this JIRA. +1 on the patch.
        Hide
        Aaron Kimball added a comment -

        I just committed this to trunk.

        Show
        Aaron Kimball added a comment - I just committed this to trunk.
        Hide
        Hudson added a comment -

        Integrated in Sqoop-jdk-1.6 #5 (See https://builds.apache.org/job/Sqoop-jdk-1.6/5/)
        SQOOP-308. Generated Avro Schema cannot handle nullable fields.

        kimballa : http://svn.apache.org/viewvc/?view=rev&rev=1154059
        Files :

        • /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/TestAvroImport.java
        • /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/orm/AvroSchemaGenerator.java
        Show
        Hudson added a comment - Integrated in Sqoop-jdk-1.6 #5 (See https://builds.apache.org/job/Sqoop-jdk-1.6/5/ ) SQOOP-308 . Generated Avro Schema cannot handle nullable fields. kimballa : http://svn.apache.org/viewvc/?view=rev&rev=1154059 Files : /incubator/sqoop/trunk/src/test/com/cloudera/sqoop/TestAvroImport.java /incubator/sqoop/trunk/src/java/com/cloudera/sqoop/orm/AvroSchemaGenerator.java

          People

          • Assignee:
            Aaron Kimball
            Reporter:
            Aaron Kimball
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development