[SPARK-42439] Job description in v2 FileWrites can have the wrong committer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: In Progress
Priority: Minor
Resolution: Unresolved
Affects Version/s: 3.3.1
Fix Version/s: None
Component/s: Spark Core
Labels:
- bug
- pull-request-available

Description

There is a difference in behavior between v1 writes and v2 writes in the order of events happening when configuring the file writer and the committer.

v1:

writer.prepareWrite()
committer.setupJob()

v2:

committer.setupJob()
writer.prepareWrite()

This is because the `prepareWrite()` call (that is the one performing the call `
job.setOutputFormatClass(classOf[ParquetOutputFormat[Row]])`)
happens as part of the `createWriteJobDescription` which is `lazy val` in the `toBatch` call and therefore is evaluated after the `committer.setupJob` at the end of the `toBatch`

This causes issues when evaluating the committer as some elements might be missing, for example the aforementioned output format class not being set, causing the committer being set up as generic write instead of parquet write.

The fix is very simple and it is to make the `createJobDescription` call non-lazy

Attachments

Issue Links

links to

[Github] Pull Request #40017 (LorenzoMartini)

[Github] Pull Request #40018 (LorenzoMartini)

GitHub Pull Request #40018

Activity

People

Assignee:: Unassigned

Reporter:: Lorenzo Martini

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 14/Feb/23 17:00

Updated:: 14/Sep/23 00:17