Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34416

Support avroSchemaUrl in addition to avroSchema

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 3.2.0
    • 3.2.0
    • SQL
    • None

    Description

      We have a use case in which we read a huge table in Avro format. About 30k columns.

      using the default Hive reader - `AvroGenericRecordReader` it is just hangs forever. after 4 hours not even one task has finished.

      We tried instead to use `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:

      ```

      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema

      ..

      at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
      at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
      at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
      at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
      ... 53 elided

      ```

       

      because files schema contain duplicate column names (when considering case-insensitive).

      So we wanted to provide a user schema with non-duplicated fields, but the schema is huge. a few MBs. it is not practical to provide it in json format.

       

      So we patched spark-avro to be able to get also `avroSchemaUrl` in addition to `avroSchema` and it worked perfectly.

       

       

       

       

       

      Attachments

        Activity

          People

            uzadude Ohad Raviv
            uzadude Ohad Raviv
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: