[SPARK-39279] Fasten the schema inference of CSV/JSON data source - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 3.4.0
Fix Version/s: None
Component/s: SQL
Labels:
None

Description

In the current implementation of CSV/JSON data source, the schema inference relies on methods that will throw exceptions if the fields can't convert as some data types.

Throwing and catching exceptions can be slow. We can improve it by creating methods that return optional results instead. A good example is https://github.com/apache/spark/pull/36562, which reduces the schema inference time by 90%.

Attachments

Sub-Tasks

1.	Fasten Timestamp type inference of default format in JSON/CSV data source	Resolved	Gengliang Wang
2.	Speed up Timestamp type inference with user-provided format in JSON/CSV data source	Resolved	Jia Fan
3.	Speed up Timestamp type inference of legacy format in JSON/CSV data source	Resolved	Jia Fan
4.	Add benchmark for Timestamp type inference when use invalid value	Resolved	Jia Fan

Activity

People

Assignee:: Unassigned

Reporter:: Gengliang Wang

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 25/May/22 02:48

Updated:: 25/May/22 02:48