Uploaded image for project: 'Apache Gobblin'
  1. Apache Gobblin
  2. GOBBLIN-571

JsonIntermediateToParquetGroupConverter generates wrong parquet schema for complex types such as enums, arrays and maps

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Reopened
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 0.15.0
    • Component/s: None
    • Labels:
      None

      Description

      For complex types such as arrays, maps and enums 

      JsonIntermediateToParquetGroupConverter is generating wrong schema. For enums, arrays and maps the OPTIONAL and REQUIRED attribute of the SchemaField is messed up.

       

      Due to this spark throws the following errors when reading parquet files generated using JsonIntermediateToParquetGroupConverter

      Caused by: parquet.io.ParquetDecodingException: Can not read value at 0 

      Ex of a wrong schema generated is below. Notice the field payload.action is marked as required

      message EventData {
      optional int64 id;
      optional binary type (UTF8);
      required group actor {
      optional int64 id;
      optional binary login (UTF8);
      optional binary gravatar_id (UTF8);
      optional binary url (UTF8);
      optional binary avatar_url (UTF8);
      }
      required group repo {
      optional int64 id;
      optional binary name (UTF8);
      optional binary url (UTF8);
      optional binary urlid (UTF8);
      }
      required group payload {
      optional int64 id;
      optional binary ref (UTF8);
      optional binary ref_type (UTF8);
      optional binary master_branch (UTF8);
      optional binary description (UTF8);
      optional binary pusher_type (UTF8);
      optional binary before (UTF8);
      required binary action (UTF8);
      }
      optional boolean public;
      optional binary created_at (UTF8);
      optional binary created_at_id (UTF8);
      }
      

      But the field payload.action which is defined in the source.schema property is set to isNullable: true

      [ ....
          {
          "columnName": "payload",
          "dataType": {
            "type": "record",
            "name": "payloadDetails",
            "values": [
              ....
              {
                "columnName": "action",
                "isNullable": true,
                "dataType": {
                  "type": "enum",
                  "name": "actionType",
                  "symbols": [
                    "started",
                    "published",
                    "opened",
                    "closed",
                    "created",
                    "reopened",
                    "added"
                  ]
                }
              }
            ]
          }
        }....
      ]
      
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tilakpatidar Tilak Patidar
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: