[SPARK-34416] Support avroSchemaUrl in addition to avroSchema - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.2.0
Fix Version/s: 3.2.0
Component/s: SQL
Labels:
None

Description

We have a use case in which we read a huge table in Avro format. About 30k columns.

using the default Hive reader - `AvroGenericRecordReader` it is just hangs forever. after 4 hours not even one task has finished.

We tried instead to use `spark.read.format("com.databricks.spark.avro").load(..)` but we failed on:

```

org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema

at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:85)
at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:67)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:421)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174)
... 53 elided

```

because files schema contain duplicate column names (when considering case-insensitive).

So we wanted to provide a user schema with non-duplicated fields, but the schema is huge. a few MBs. it is not practical to provide it in json format.

So we patched spark-avro to be able to get also `avroSchemaUrl` in addition to `avroSchema` and it worked perfectly.

Attachments

Issue Links

links to

[Github] Pull Request #31543 (uzadude)

Activity

People

Assignee:: Ohad Raviv

Reporter:: Ohad Raviv

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Feb/21 10:29

Updated:: 14/Feb/21 21:46

Resolved:: 14/Feb/21 21:44