Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-42406

[PROTOBUF] Recursive field handling is incompatible with delta

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.4.0
    • 3.4.0
    • Protobuf
    • None

    Description

      Protobuf deserializer (`from_protobuf()` function()) optionally supports recursive fields by limiting the depth to certain level. See example below. It assigns a 'NullType' for such a field when allowed depth is reached. 

      It causes a few issues. E.g. a repeated field as in the following example results in a Array field with 'NullType'. Delta does not support null type in a complex type.

      Actually `Array[NullType]` is not really useful anyway.

      How about this fix: Drop the recursive field when the limit reached rather than using a NullType. 

      The example below makes it clear:

      Consider a recursive Protobuf:

       

      message TreeNode {
        string value = 1;
        repeated TreeNode children = 2;
      }
      

      Allow depth of 2: 

       

         df.select(
          'proto',
           messageName = 'TreeNode',
           options = { ... "recursive.fields.max.depth" : "2" }
        ).printSchema()
      

      Schema looks like this:

      root
      |– from_protobuf(proto): struct (nullable = true)|
      | |– value: string (nullable = true)|
      | |– children: array (nullable = false)|
      | | |– element: struct (containsNull = false)|
      | | | |– value: string (nullable = true)|
      | | | |– children: array (nullable = false)|
      | | | | |– element: struct (containsNull = false)|
      | | | | | |– value: string (nullable = true)|
      | | | | | |– children: array (nullable = false). [ === Proposed fix: Drop this field === ]|
      | | | | | | |– element: void (containsNull = false) [ === NOTICE 'void' HERE === ] 
      

      When we try to write this to a delta table, we get an error:

      AnalysisException: Found nested NullType in column from_protobuf(proto).children which is of ArrayType. Delta doesn't support writing NullType in complex types.
      

       
      We could just drop the field 'element' when recursion depth is reached. It is simpler and does not need to deal with NullType. We are ignoring the value anyway. There is no use in keeping the field.

      Another issue is setting for 'recursive.fields.max.depth': It is not enforced correctly. '0' does not make sense. 

       

      Attachments

        Activity

          People

            rangadi Raghu Angadi
            rangadi Raghu Angadi
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: