Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27506

Function `from_avro` doesn't allow deserialization of data using other compatible schemas

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • SQL
    • None

    Description

       SPARK-24768 and subtasks introduced support to read and write Avro data by parsing a binary column of Avro format and converting it into its corresponding catalyst value (and viceversa).

       

      The current implementation has the limitation of requiring deserialization of an event with the exact same schema with which it was serialized. This breaks one of the most important features of Avro, schema evolution https://docs.confluent.io/current/schema-registry/avro.html - most importantly, the ability to read old data with a newer (compatible) schema without breaking the consumer.

       

      The GenericDatumReader in the Avro library already supports passing an optional writer's schema (the schema with which the record was serialized) alongside a mandatory reader's schema (the schema with which the record is going to be deserialized). The proposed change is to do the same in the from_avro function, allowing the possibility to pass an optional writer's schema to be used in the deserialization.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            fokko Fokko Driesprong
            giamo Gianluca Amori
            Gengliang Wang Gengliang Wang
            Votes:
            1 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment