[SPARK-24768] Have a built-in AVRO data source implementation - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.4.0
Fix Version/s: 2.4.0
Component/s: SQL
Labels:
None

Description

Apache Avro (https://avro.apache.org) is a popular data serialization format. It is widely used in the Spark and Hadoop ecosystem, especially for Kafka-based data pipelines. Using the external package https://github.com/databricks/spark-avro, Spark SQL can read and write the avro data. Making spark-Avro built-in can provide a better experience for first-time users of Spark SQL and structured streaming. We expect the built-in Avro data source can further improve the adoption of structured streaming. The proposal is to inline code from spark-avro package (https://github.com/databricks/spark-avro). The target release is Spark 2.4.

Attachments

Built-in AVRO Data Source In Spark 2.4.pdf
10/Jul/18 17:24
77 kB
Gengliang Wang

Issue Links

Add Link

contains

SPARK-24741 Have a built-in AVRO data source implementation

Resolved

Delete this link

is duplicated by

SPARK-26062 Rename spark-avro external module to spark-sql-avro (to match spark-sql-kafka)

Closed

Delete this link

links to

[Github] Pull Request #21742 (gengliangwang)

Delete this link

[Github] Pull Request #21801 (ueshin)

Delete this link

[Github] Pull Request #21841 (gengliangwang)

Delete this link

[Github] Pull Request #21866 (gengliangwang)

Delete this link

(1 links to)

Sub-Tasks

Create Sub-Task

1.	Add function `from_avro` and `to_avro`	Resolved	Gengliang Wang	Actions
2.	Upgrade AVRO version from 1.7.7 to 1.8.2	Resolved	Gengliang Wang	Actions
3.	support reading AVRO logical types - Date	Resolved	Gengliang Wang	Actions
4.	support reading AVRO logical types - Timestamp with different precisions	Resolved	Gengliang Wang	Actions
5.	support reading AVRO logical types - Decimal	Resolved	Gengliang Wang	Actions
6.	support reading AVRO logical types - Duration	Resolved	Unassigned	Actions
7.	AVRO unit test: use SQLTestUtils and Replace deprecated methods	Resolved	Gengliang Wang	Actions
8.	Add write benchmark for AVRO	Resolved	Gengliang Wang	Actions
9.	Add API `.avro` in DataFrameReader/DataFrameWriter	Resolved	Unassigned	Actions
10.	Refactor Avro Serializer and Deserializer	Resolved	Gengliang Wang	Actions
11.	Don't ignore files without .avro extension by default	Resolved	Max Gekk	Actions
12.	Fix paths to resource files in AvroSuite	Resolved	Max Gekk	Actions
13.	Support for parsing AVRO binary column	Resolved	Unassigned	Actions
14.	Supporting to convert a column into binary of AVRO format	Resolved	Unassigned	Actions
15.	New option - ignoreExtension	Resolved	Max Gekk	Actions
16.	Gather all options into AvroOptions	Resolved	Max Gekk	Actions
17.	Simplify schema serialization	Resolved	Gengliang Wang	Actions
18.	New options - compression and compressionLevel	Resolved	Max Gekk	Actions
19.	Remove implicit class AvroDataFrameWriter/AvroDataFrameReader	Resolved	Gengliang Wang	Actions
20.	Use SerializableConfiguration in Spark util	Resolved	Gengliang Wang	Actions
21.	Add mapping for built-in Avro data source	Resolved	Dongjoon Hyun	Actions
22.	Use internal.Logging instead for logging	Resolved	Hyukjin Kwon	Actions
23.	Avro: revise the output record namespace	Resolved	Gengliang Wang	Actions
24.	Generate Avro Binary files in test suite	Resolved	Gengliang Wang	Actions
25.	Validate user specified output schema	Resolved	Gengliang Wang	Actions
26.	Make the mapping of com.databricks.spark.avro to built-in module configurable	Resolved	Unassigned	Actions
27.	Documentaion: AVRO data source guide	Resolved	Gengliang Wang	Actions
28.	Remove sql configuration spark.sql.avro.outputTimestampType	Resolved	Gengliang Wang	Actions
29.	Detect recursive reference in Avro schema and throw exception	Resolved	Gengliang Wang	Actions
30.	Support parse mode option for function `from_avro`	Resolved	Gengliang Wang	Actions
31.	Override method `prettyName` in `from_avro`/`to_avro`	Resolved	Gengliang Wang	Actions
32.	Add read benchmark for Avro	Resolved	Gengliang Wang	Actions
33.	Avro: Validate input and output schema	Resolved	Gengliang Wang	Actions
34.	Allow user-specified output schema in function `to_avro`	Resolved	Gengliang Wang	Actions
35.	Show Avro related API in documentation	Resolved	Gengliang Wang	Actions

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Gengliang Wang

Reporter:: Gengliang Wang

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 10/Jul/18 17:21

Updated:: 09/Jun/19 23:35

Resolved:: 17/Sep/18 05:25

Agile

View on Board

Have a built-in AVRO data source implementation

Details

Description

Attachments

Attachments

Issue Links

Sub-Tasks

Activity

People

Dates

Agile

Slack

Issue deployment