[SPARK-24881] New options - compression and compressionLevel - ASF JIRA

Rank to Top

Rank to Bottom

Attach files

Attach Screenshot

Bulk Copy Attachments

Bulk Move Attachments

Voters

Watch issue

Watchers

Convert to Issue

Link

Clone

Labels

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.3.1
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Description

Currently Avro datasource takes the compression codec name from SQL config (config key is hard coded in AvroFileFormat): https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L121-L125 . The obvious cons of it is modification of the global config can impact of multiple writes.

A purpose of the ticket is to add new Avro option - "compression" the same as we already have for other datasource like JSON, CSV and etc. If new option is not set by an user, we take settings from SQL config spark.sql.avro.compression.codec. If the former one is not set too, default compression codec will be snappy (this is current behavior in the master).

Besides of the compression option, need to add another option - compressionLevel which should reflect another SQL config in Avro: https://github.com/apache/spark/blob/106880edcd67bc20e8610a16f8ce6aa250268eeb/external/avro/src/main/scala/org/apache/spark/sql/avro/AvroFileFormat.scala#L122