Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38245

Avro Complex Union Type return `member$I`

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.2.1
    • None
    • SQL

    Description

      Short Description

      When reading complex union types from Avro files, there seems to be some information lost as the name of the record is omitted and member$i is instead returned.

      Long Description

      Error

      Given the Avro schema schema.avsc, I would expected the schema when reading the avro file using read_avro.py to be as expected.txt. Instead, I get the schema output in reality.txt where RecordOne became member0, etc.

      This causes information lost and makes the DataFrame unusable.

      From my understanding this behavior was implemented here.

       

      read_avro.py
      df = spark.read.format("avro").load("path/to/my/file.avro")
      df.printSchema()
       
      schema.avsc
       {
       "type": "record",
       "name": "SomeData",
       "namespace": "my.name.space",
       "fields": [
        {
         "name": "ts",
         "type": {
          "type": "long",
          "logicalType": "timestamp-millis"
         }
        },
        {
         "name": "field_id",
         "type": [
          "null",
          "string"
         ],
         "default": null
        },
        {
         "name": "values",
         "type": [
          {
           "type": "record",
           "name": "RecordOne",
           "fields": [
            {
             "name": "field_a",
             "type": "long"
            },
            {
             "name": "field_b",
             "type": {
              "type": "enum",
              "name": "FieldB",
              "symbols": [
                  "..."
              ],
             }
            },
            {
             "name": "field_C",
             "type": {
              "type": "array",
              "items": "long"
             }
            }
           ]
          },
          {
           "type": "record",
           "name": "RecordTwo",
           "fields": [
            {
             "name": "field_a",
             "type": "long"
            }
           ]
          }
         ]
        }
       ]
      }
      expected.txt
      root
       |-- ts: timestamp (nullable = true)
       |-- field_id: string (nullable = true)
       |-- values: struct (nullable = true)
       |    |-- RecordOne: struct (nullable = true)
       |    |    |-- field_a: long (nullable = true)
       |    |    |-- field_b: string (nullable = true)
       |    |    |-- field_c: array (nullable = true)
       |    |    |    |-- element: long (containsNull = true)
       |    |-- RecordTwo: struct (nullable = true)
       |    |    |-- field_a: long (nullable = true)
      
      reality.txt
      root
       |-- ts: timestamp (nullable = true)
       |-- field_id: string (nullable = true)
       |-- values: struct (nullable = true)
       |    |-- member0: struct (nullable = true)
       |    |    |-- field_a: long (nullable = true)
       |    |    |-- field_b: string (nullable = true)
       |    |    |-- field_c: array (nullable = true)
       |    |    |    |-- element: long (containsNull = true)
       |    |-- member1: struct (nullable = true)
       |    |    |-- field_a: long (nullable = true)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            TeddyCr Teddy Crepineau
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: